Skip to content

feat(page-cluster): add complete-linkage structural clustering within a block#906

Merged
YusukeHirao merged 2 commits into
devfrom
feat/page-cluster-resolve-structural-cluster-keys
Jul 4, 2026
Merged

feat(page-cluster): add complete-linkage structural clustering within a block#906
YusukeHirao merged 2 commits into
devfrom
feat/page-cluster-resolve-structural-cluster-keys

Conversation

@YusukeHirao

Copy link
Copy Markdown
Member

Summary

  • Add resolveStructuralClusterKeys, which clusters pages already grouped into one
    blocking key (e.g. from resolveBlockingGroupKeys) by structural similarity, using
    complete-linkage hierarchical clustering computed via the NN-chain algorithm (O(n²)).
  • Single-linkage (connected components of a similarity-threshold graph) was considered
    first but rejected: its "chaining" failure mode lets one unrepresentative page
    transitively merge two otherwise-unrelated templates, defeating template detection.
    Complete-linkage requires every pair across two clusters to clear the threshold,
    ruling that out, at no extra asymptotic cost since the pairwise similarity matrix is
    computed either way.
  • MinHash/LSH approximation is intentionally out of scope: NN-chain already computes
    the exact clustering in O(n²), and real-data validation (see test plan) confirmed
    O(n²) is fast enough at the largest real block size found (~1,400 pages, ~685ms).

Test plan

  • yarn build / yarn lint / yarn test pass
  • /code-review xhigh run; 4 findings (floating-point threshold boundary epsilon,
    unicorn/prefer-math-trunc, Prettier formatting, cspell) all fixed
  • /qa-engineer review added a differential test against a naive brute-force
    complete-linkage reference (caught one real algorithm bug during development —
    an early-exit-on-threshold that's unsound when multiple disjoint NN-chains
    exist), plus a regression test for the epsilon fix and threshold boundary tests
  • /product-manager and /doc reviews passed with one JSDoc gap fixed
  • Real-data validation against two production crawl archives: confirmed
    performance at the largest observed real block size, and found (as a documented,
    non-blocking finding for a future iteration) that per-page body classes on some
    real sites can zero out structural similarity between same-template pages —
    a caller-side normalization concern, not a bug in this function

… a block

Add resolveStructuralClusterKeys, which clusters pages already grouped into
one blocking key (e.g. from resolveBlockingGroupKeys) by structural
similarity, using complete-linkage hierarchical clustering computed via the
NN-chain algorithm for O(n^2) time.

Single-linkage (connected components of a similarity-threshold graph) was
considered first but rejected: its "chaining" failure mode lets one
unrepresentative page transitively merge two otherwise-unrelated templates,
which defeats template detection. Complete-linkage requires every pair
across two clusters to clear the threshold, ruling that out, at no extra
asymptotic cost since the pairwise similarity matrix is computed either way.

Add "medoid" and "Murtagh" to the cspell dictionary for the new file's JSDoc.
@YusukeHirao YusukeHirao requested a review from yusasa16 as a code owner July 4, 2026 02:05
@cursor

cursor Bot commented Jul 4, 2026

Copy link
Copy Markdown

Bugbot is not enabled for your account, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

@YusukeHirao YusukeHirao merged commit cdddd1a into dev Jul 4, 2026
6 checks passed
@YusukeHirao YusukeHirao deleted the feat/page-cluster-resolve-structural-cluster-keys branch July 4, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant